CS273A Homework 5¶

Due: Wednesday Nov 19 2025 (11:59pm)¶


Instructions¶

This homework (and subsequent ones) will involve data analysis and reporting on methods and results using Python code. You will submit a single PDF file that contains everything to Gradescope. This includes any text you wish to include to describe your results, the complete code snippets of how you attempted each problem, any figures that were generated, and scans of any work on paper that you wish to include. It is important that you include enough detail that we know how you solved the problem, since otherwise we will be unable to grade it.

Your homeworks will be given to you as Jupyter notebooks containing the problem descriptions and some template code that will help you get started. You are encouraged to use these starter Jupyter notebooks to complete your assignment and to write your report. This will help you not only ensure that all of the code for the solutions is included, but also will provide an easy way to export your results to a PDF file (for example, doing print preview and printing to pdf). I recommend liberal use of Markdown cells to create headers for each problem and sub-problem, explaining your implementation/answers, and including any mathematical equations. For parts of the homework you do on paper, scan it in such that it is legible (there are a number of free Android/iOS scanning apps, if you do not have access to a scanner), and include it as an image in the Jupyter notebook.

Double check that all of your answers are legible on Gradescope, e.g. make sure any text you have written does not get cut off.

If you have any questions/concerns about using Jupyter notebooks, ask us on EdD. If you decide not to use Jupyter notebooks, but go with Microsoft Word or LaTeX to create your PDF file, make sure that all of the answers can be generated from the code snippets included in the document.

Summary of Assignment: 100 total points¶

  • Problem 1: Decision Trees by Hand (25 points)
    • Problem 1.1: Shannon Entropy (5 points)
    • Problem 1.2: Information Gain (10 points)
    • Problem 1.3: Full Tree (10 points)
  • Problem 2: Decision Trees in Python (34 points)
    • Problem 2.1: Feature Statistics (8 points)
    • Problem 2.2: Initial Tree (6 points)
    • Problem 2.3: Exploring Depth Control (10 points)
    • Problem 2.4: Exploring Leaf Size (10 points)
  • Problem 3: Random Forests (20 points)
    • Problem 3.1: Training Members (10 points)
    • Problem 3.2: Ensemble Prediction (10 points)
  • Problem 4: VC Dimension (16 points)
    • Problem 4.1: Model A (4 points)
    • Problem 4.2: Model B (4 points)
    • Problem 4.3: Model C (4 points)
    • Problem 4.4: Model D (4 points)
  • Statement of Collaboration (5 points)

Before we get started, let's import some libraries that you will make use of in this assignment. Make sure that you run the code cell below in order to import these libraries.

Important: In the code block below, we set seed=1234. This is to ensure your code has reproducible results and is important for grading. Do not change this. If you are not using the provided Jupyter notebook, make sure to also set the random seed as below.

Important: Do not change any codes we give you below, except for those waiting for you to complete. This is to ensure your code has reproducible results and is important for grading.

In [1]:
import numpy as np
import matplotlib.pyplot as plt

import requests                                      # reading data
from io import StringIO

from sklearn.datasets import fetch_openml            # common data set access
from sklearn.preprocessing import StandardScaler     # scaling transform
from sklearn.model_selection import train_test_split # validation tools
from sklearn.metrics import zero_one_loss as J01

import sklearn.tree as tree

# Fix the random seed for reproducibility
# !! Important !! : do not change this
seed = 1234
np.random.seed(seed)

Problem 1: Decision Trees for Spam¶

In order to reduce my email load, I decide to implement a machine learning algorithm to decide whether or not I should read an email, or simply file it away instead. To train my model, I obtain the following data set of binary-valued features about each email, including whether I know the author or not, whether the email is long or short, and whether it has any of several key words, along with my final decision about whether to read it ($y=+1$ for "read", $y=-1$ for "discard").

X1 X2 X3 X4 X5 y
(know author?) (is long?) (has 'research'?) (has 'grade'?) (has 'lottery'?) (read?)
0 0 1 1 0 -1
1 1 0 1 0 -1
0 1 1 1 1 -1
1 1 1 1 0 -1
0 1 0 0 0 -1
1 0 1 1 1 1
0 0 1 0 0 1
1 0 0 0 0 1
1 0 1 1 0 1
1 1 1 1 1 -1

In the case of any ties where both classes have equal probability, we will prefer to predict class $+1$.

Solve the following problems "by hand" (you can use python for logarithms, etc.)

Problem 1.1 (5 points)¶

Calculate the Shannon entropy $H(y)$ of the binary class variable $y$, in bits. Hint: Your answer should be a number between 0 and 1.

Screenshot 2025-11-23 010500.png

Problem 1.2 (5 points)¶

Calculate the information gain for each feature $x_i$. Which feature should be split first?

Screenshot 2025-11-23 010628.png Screenshot 2025-11-23 010941.png

Problem 1.3 (5 points)¶

Draw (or otherwise illustrate) the complete decision tree that will be learned from these data.

Screenshot 2025-11-23 010721.png

Problem 2: Decision trees in Python¶

In the next problem, we will use decision trees to predict a data set used for Kaggle competitions in older iterations of the course; see e.g., https://www.kaggle.com/c/uc-irvine-cs273a-2016

Note that I have altered the data from that competition to make the target variables +1 and -1, instead of 0/1, to better match some of the ensemble slides.

First, let's load the data:

In [2]:
url = 'https://www.ics.uci.edu/~ihler/classes/cs273/data/precip_train.txt'

with requests.get(url, verify=False) as link:
  data = np.genfromtxt(StringIO(link.text),delimiter=None)

X,Y = data[:,:-1], data[:,-1]
c:\Python311\Lib\site-packages\urllib3\connectionpool.py:1045: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.ics.uci.edu'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
c:\Python311\Lib\site-packages\urllib3\connectionpool.py:1045: InsecureRequestWarning: Unverified HTTPS request is being made to host 'ics.uci.edu'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(

Problem 2.1¶

Compute and print some useful statistics about the data -- the minimum, maximum, mean, and variance of each of the features.

In [4]:
# Stats of each feature

for i in range(X.shape[1]):
    print(f"Feature X{i+1}: min={X[:,i].min():.4f}, max={X[:,i].max():.4f}, mean={X[:,i].mean():.4f}, var={X[:,i].var():.4f}")
Feature X1: min=197.0000, max=253.0000, mean=241.8990, var=81.1988
Feature X2: min=190.0000, max=248.0000, mean=228.3813, var=89.1503
Feature X3: min=214.9700, max=252.0200, mean=241.9059, var=34.5577
Feature X4: min=205.4200, max=252.0200, mean=233.8254, var=94.5072
Feature X5: min=10.0000, max=17130.0000, mean=2849.0465, var=10505588.3006
Feature X6: min=0.0000, max=12338.0000, mean=862.8611, var=3090415.2075
Feature X7: min=0.0000, max=9238.0000, mean=163.6526, var=698073.3557
Feature X8: min=0.0000, max=27.4190, mean=3.0558, var=7.2769
Feature X9: min=1.2189, max=18.1070, mean=6.3114, var=6.1830
Feature X10: min=0.0000, max=11.3680, mean=1.8939, var=4.1509
Feature X11: min=0.0000, max=18.7710, mean=4.2896, var=3.9446
Feature X12: min=0.0000, max=14.7450, mean=2.7978, var=1.9323
Feature X13: min=1.0271, max=278.7100, mean=10.4525, var=170.0018
Feature X14: min=-999.9000, max=769.2000, mean=7.6581, var=1528.9474

In past assignments, we have first normalized the data (subtracting the mean and scaling each feature). We will not do that here, however. Why is that step unnecessary for our decision tree model?

In [ ]:
# Normalization would be unnecessory for decision trees because they are use thresholds to split the data, and they are not affected by the scale of the features.

Problem 2.2¶

To allow faster experimentation, select the first 10,000 data points as training data, Xtr, Ytr, and the next 10,000 data points as validation, Xva, Yva. Then, learn a decision tree classifier on Xtr using the DecisionTreeClassifier class from scikit-learn. Use a large depth, say max_depth=100, and the Shannon entropy impurity function in your training. Also, use random_state=seed to ensure consistency.

Compute and print out the training error rate and validation error rate of your model.

In [5]:
# data split
X_tr, Y_tr = X[:10000], Y[:10000]
X_val, Y_val = X[10000:20000], Y[10000:20000]

# decision tree training using shannon entropy
dt = tree.DecisionTreeClassifier(max_depth=100, criterion='entropy', random_state=seed)
dt.fit(X_tr, Y_tr)

# training and validation error
train_err = J01(Y_tr, dt.predict(X_tr))
val_err = J01(Y_val, dt.predict(X_val))
print(f"Training error: {train_err:.4f}")
print(f"Validation error: {val_err:.4f}")
Training error: 0.0098
Validation error: 0.3840

Problem 2.3¶

Now try varying the max_depth parameter, , which forces the tree learning algorithm to stop after at most that many levels. Test max_depth values in the range {1, 2, 3, ..., 16}, and plot the training and validation error rates versus max_depth. Do models with higher max_depth have higher or lower complexity? What choice of max_depth provides the best decision tree model?

In [ ]:
max_depths = range(1, 17)
train_errs = []
val_errs = []

for depth in max_depths:
    dt = tree.DecisionTreeClassifier(max_depth=depth, criterion='entropy', random_state=seed)
    dt.fit(X_tr, Y_tr)
    train_errs.append(J01(Y_tr, dt.predict(X_tr)))
    val_errs.append(J01(Y_val, dt.predict(X_val)))

plt.plot(figure=(10,6))
plt.plot(max_depths, train_errs, label='Training Error')
plt.plot(max_depths, val_errs, label='Validation Error')
plt.xlabel('Max Depth')
plt.ylabel('Error Rate')
plt.legend()
plt.grid()
plt.show()

print("Models with higher max_depth have higher complexity because they can create more splits in the data, meaning they can get more patterns.")
print("The best choice of max_depth is the one that gives the lowest validation error. And based on the plot, it seems to be at max_depth = 4.")
Models with higher max_depth have higher complexity because they can create more splits in the data, meaning they can get more patterns.
The best choice of max_depth is the one that gives the lowest validation error. And based on the plot, it seems to be at max_depth = 4.

Problem 2.4¶

The min_samples_leaf parameter controls the complexity of decision trees by lower bounding the amount of data required to split nodes when learning. Fixing max_depth=100, compute and plot the training and validation error rates for min_samples_leaf values in the range {2**0, 2**1, 2**2, ... 2**17}. Do models with higher min_samples_leaf have higher or lower complexity? What choice of min_samples_leaf provides the best decision tree model? Is this model better or worse than your depth-controlled model?

In [ ]:
min_samples_leaf = [2**i for i in range(18)]
train_errs = []
val_errs = []

for i in min_samples_leaf:
    dt = tree.DecisionTreeClassifier(max_depth=100, min_samples_leaf=i, criterion='entropy', random_state=seed)
    dt.fit(X_tr, Y_tr)
    train_errs.append(J01(Y_tr, dt.predict(X_tr)))
    val_errs.append(J01(Y_val, dt.predict(X_val)))

plt.plot(figure=(10,6))
plt.plot(min_samples_leaf, train_errs, label='Training Error')
plt.plot(min_samples_leaf, val_errs, label='Validation Error', marker='o')
plt.xlabel('Min Samples Leaf')
plt.ylabel('Error Rate')
plt.xscale('log', base=2)
plt.legend()
plt.grid()
plt.show()

print("Models with higher min_samples_leaf have lower complexity because they require more samples to split a node, meaning they create fewer splits.")
print("The best choice of min_samples_leaf is the one that gives the lowest validation error. And based on the plot, it seems to be at min_samples_leaf = 2^8 = 256.")
print("This model is worse than the depth controlled model.")
Models with higher min_samples_leaf have lower complexity because they require more samples to split a node, meaning they create fewer splits.
The best choice of min_samples_leaf is the one that gives the lowest validation error. And based on the plot, it seems to be at min_samples_leaf = 2^8 = 256.
This model is worse than the depth controlled model.

Problem 3: Random Forests¶

Although scikit has its own implementations of bagging estimators and random forests, here we will build a random forest model manually, for illustration.

Random Forests are bagged collections of decision trees, which select their decision nodes from randomly chosen subsets of the possible features (rather than all features). You can implement this easily in DecisionTreeClassifier using option max_features=n, where n is the number of features to select from. Since you have ~ 14 features, you might choose, say, n=10; smaller values of n will make each tree more random, but may require you to have a high maximum depth to ensure you still fit the data effectively. You'll write a for-loop to build the ensemble members, and another to compute the prediction of the ensemble.

Problem 3.1¶

Using your previous split of the data into training & validation sets, learn a bagged ensemble of decision trees on the training data; then in the next part, you will evaluate the validation performance of the ensemble. (See the pseudocode from lecture slides.) For your individual learners, use parameters that will not constrain the complexity of the resulting tree very much (e.g., set your depth cutoff >= 15, min leaf 1-4, etc.), since the bagging will be used to control overfitting instead.

You will implement bootstrap sampling yourself (again, see pseudocode); simply generate $m'$ data indices uniformly with replacement, using numpy functions or simply rand and conversion to integers. For your bootstrap process, draw the same number of data as in your training set after the validation split (i.e., set $m' = m$, the number of training data).

Learn at least 50 ensemble members, as you will use them in the next part.

In [20]:
# Bagged ensemble training
bagged_ensemble = []
num_b = 50
m = X_tr.shape[0]

for i in range(num_b):
    indx = np.random.randint(0, m, m)  # Bootstrap sample indices
    X_boot = X_tr[indx]
    Y_boot = Y_tr[indx]

    dt = tree.DecisionTreeClassifier(max_depth=15, min_samples_leaf=1, criterion='entropy', random_state=seed)
    dt.fit(X_boot, Y_boot)
    bagged_ensemble.append(dt)

Problem 3.2¶

Plot the training and validation error of the ensemble as a function of the number of learners $B$ that you include in the ensemble, for (at least) values $B \in \{1,5,10,25,50\}$. (Just use the first $B$ learners that you trained in part 1.)

Remember that your ensemble predicts by making a prediction for each member model, and then taking a vote among the $B$ models. Since your target classes are +1 and -1, as in the slide pseudocode, to find the outcome of the vote you can simply check whether the average of the class predictions is positive or negative.

In [22]:
B_values = [1, 5, 10, 25, 50]

train_errors = []
val_errors = []

def ensemble_predict(ensemble, X, B):
    preds = np.array([model.predict(X) for model in ensemble[:B]])
    return np.sign(np.mean(preds, axis=0))

for B in B_values:
    Y_tr_pred = ensemble_predict(bagged_ensemble, X_tr, B)
    Y_val_pred = ensemble_predict(bagged_ensemble, X_val, B)

    train_errors.append(J01(Y_tr, Y_tr_pred))
    val_errors.append(J01(Y_val, Y_val_pred))

plt.plot(figure=(10,6))
plt.plot(B_values, train_errors, label='Training Error')
plt.plot(B_values, val_errors, label='Validation Error', marker='o')
plt.xlabel('Number of Learners B')
plt.ylabel('Error Rate')
plt.legend()
plt.grid()
plt.show()

Problem 4: Shattering & VC Dimension¶

Consider the following four "datasets", which have two real-valued features $x_1,x_2$:

In [16]:
fig, axes = plt.subplots(1, 4, figsize=(12, 3))
x = np.array([[2,2],[4,8],[6,3],[8,6]])
for i in range(4): 
    axes[i].plot(x[:i+1,0],x[:i+1,1],'ro',ms=8); axes[i].axis([0,10,0,10]); axes[i].set_title(f'({i+1})')
    axes[i].set_xticks(np.unique(x[:,0])); axes[i].set_yticks(np.unique(x[:,1])); 

(Note that no three of the four points are co-linear.)

For each of the learners listed below, answer (a) which of the four data sets can be shattered by a learner of that form? Give a brief explanation / justification (1-2 sentences). Then use your results to (b) guess the VC dimension of the classifier (you do not have to give a formal proof, just your reasoning).

In each learner, the parameters $a,b,c,\ldots$ are real-valued scalars, and $T[z]$ is the sign threshold function, i.e., $T[z]=+1$ for $z\geq 0$ and $T[z] = -1$ for $z<0$.

  • $T(\ a + b \ x_1 \ )$
In [ ]:
# (a) which data sets & why? .
# the first dataset can be shattered because a single point ca be labeled as +1 or -1
# the second dataset can be shattered because two points can also be separated
# the third and fourth data sets cannot be shattered because there is no way to separate the points

# (b) VC dimension guess?
# The VC dimension is 2, because it can shatter a maximum of 2 points.
  • $T(\ (a*b)\ x_1 + (c/a) \ x_2 \ )$
In [ ]:
# (a) which data sets & why? 
# the first dataset can be shattered because a single point can be separated
# the second dataset can be shattered because two points can also be separated
# the third dataset can be shattered because three points can also be separated as any three points are not collinear.
# the fourth dataset cannot be shattered because there is no way to separate the points

# (b) VC dimension guess?
# The VC dimension is 3, because it can shatter a maximum of 3 points.
  • $T(\ (x_1-a)^2 + (x_2 -b)^2 - c \ )$
In [ ]:
# (a) which data sets & why? 
# the first dataset can be shattered because a single point can be labeled as +1 or -1
# The second dataset can be shattered because a circle can include neither, one, or both points, so all four labelings are possible.
# the third dataset can be shattered because any three points can be assigned all possible labelings using circles.
# the fourth dataset cannot be shattered because there is at least one labeling of four points in general position that no single circle can represent.

# (b) VC dimension guess?
# The VC dimension is 3
  • $T(\ a + b\,x_1 + c\,x_2 \ ) \times T(\ d + b\,x_1 + c\,x_2 \ )$
    • Hint: the two linear equations correspond to two parallel lines, since both have the same coefficients for $x_1$ and $x_2$. Then, $T(z)\times T(z') = +1$ if and only if $z$ and $z'$ have the same sign.
In [ ]:
# (a) which data sets & why? 
# the first dataset can be shattered because a single point can be labeled as +1 or -1
# the second dataset can be shattered because the strip can include neither, one, or both points, so all four labelings are possible.
# the third dataset can be shattered because the adjustable linear boundary can separate any three points
# the fourth dataset cannot be shattered because there is at least one labeling of four points that cannot be represented

# (b) VC dimension guess?
# The VC dimension is 3

Statement of Collaboration (5 points)¶

It is mandatory to include a Statement of Collaboration in each submission, with respect to the guidelines below. Include the names of everyone involved in the discussions (especially in-person ones), and what was discussed.

All students are required to follow the academic honesty guidelines posted on the course website. For programming assignments, in particular, I encourage the students to organize (perhaps using EdD) to discuss the task descriptions, requirements, bugs in my code, and the relevant technical content before they start working on it. However, you should not discuss the specific solutions, and, as a guiding principle, you are not allowed to take anything written or drawn away from these discussions (i.e. no photographs of the blackboard, written notes, referring to EdD, etc.). Especially after you have started working on the assignment, try to restrict the discussion to EdD as much as possible, so that there is no doubt as to the extent of your collaboration.

I have not collaborated with anyone on this assignment.